431 Class 21

Thomas E. Love, Ph.D.

2023-11-14

Today’s Agenda

Regression Assumptions (Chapter 30 in Course Notes)

  • A closer look
  • Calibrating yourself

On Contingency Tables (Chapter 28 in Course Notes)

  • Building a J x K Table
  • Chi-Square Tests of Independence
    • Cochran Conditions and Checking Assumptions

Today’s Packages

library(vcd)           # for mosaic plots
library(janitor)       # clean_names(), tabyl(), etc. 
library(patchwork)     # combining ggplot2 plots
library(sessioninfo)   # for session_info()
library(tidyverse)

theme_set(theme_bw())

Residual Plots and Regression Assumptions

Multivariate Regression Assumptions

see Course Notes, Section 30

  • Linearity
  • Normality
  • Homoscedasticity
  • Independence

Available Residual Plots

plot(model, which = c(1:3,5))

  1. Residuals vs. Fitted Values
  2. Normal Q-Q Plot of Standardized Residuals
  3. Scale-Location Plot
  4. Index Plot of Cook’s Distance
  5. Residuals, Leverage and Influence

An Idealized Model (by Simulation)

set.seed(431122)

x1 <- rnorm(200, 20, 5)
x2 <- rnorm(200, 20, 12)
x3 <- rnorm(200, 20, 10)
er <- rnorm(200, 0, 1)
y <- .3*x1 - .2*x2 + .4*x3 + er

sim0 <- tibble(y, x1, x2, x3)

mod0 <- lm(y ~ x1 + x2 + x3, data = sim0)

summary(mod0) # shown on next slide

An Idealized Model (by Simulation)


Call:
lm(formula = y ~ x1 + x2 + x3, data = sim0)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.14553 -0.68079  0.08096  0.69216  2.65265 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.122852   0.348584   0.352    0.725    
x1           0.285539   0.014211  20.093   <2e-16 ***
x2          -0.204908   0.005828 -35.159   <2e-16 ***
x3           0.413308   0.007172  57.631   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.007 on 196 degrees of freedom
Multiple R-squared:  0.9589,    Adjusted R-squared:  0.9583 
F-statistic:  1524 on 3 and 196 DF,  p-value: < 2.2e-16

Residual Plots for Idealized Model

par(mfrow=c(2,2)); plot(mod0); par(mfrow=c(1,1))

shown on next page (note: #| fig-height: 7 is helpful)

  • Residuals vs. Fitted values (Top Left)
  • Normal Q-Q plot of Standardized Residuals (Top Right)
  • Scale-Location plot (Bottom Left)
  • Residuals vs. Leverage, with Cook’s Distance contours (Bottom Right)

Regression assumptions violated?

  • Non-linearity problems show up as a curve in Residuals vs. Fitted plot
  • Heteroscedasticity problems show up as a fan in the Residuals vs. Fitted plot, and show up as a trend (up or down) in the Scale-Location plot
  • Non-Normality problems show up as outliers (all plots)
    • Normal Q-Q plot of standardized residuals
    • Bottom Right plot shows each point’s residual, leverage & (if problematic) influence

What to Do?

Importance of Assumptions (1-3):

  1. Linearity (critical, but amenable to transformations, often)
  2. Independence (critical, not relevant if data are a cross-section with no meaningful ordering in space or time, but vitally important if space/time play a meaningful role - longitudinal data analysis required)
  3. Homoscedasticity (constant variance: important, sometimes amenable to transformation)

What to Do?

Importance of Assumptions (4-6):

  1. Normality due to skew (usually amenable to transformation)
  2. Normality due to many more outliers than we would expect (heavy-tailed - inference is problematic unless you account for this, sometimes a transformation can help)
  3. Normality due to a severe outlier (or a small number of severely poorly fitted points - can consider setting those points away from modeling, but requires a meaningful external explanation)

What about collinearity?

“No collinearity” is not a regression assumption, but if we see substantial collinearity, we are inclined to consider dropping some of the variables, or combining them (height and weight may be highly correlated, height and BMI may be less so).

The variance inflation factor (or VIF), if it exceeds 5, is a clear indication of collinearity. We’d like to see the variances inflated only slightly (that is, VIF not much larger than 1) by correlation between the predictors, to facilitate interpretation.

What about collinearity?

The best way to tell if you’ve improved the situation by fitting an alternative model is to actually compare and fit the two models, looking in particular at:

  • the standard errors of their coefficients, and
  • their VIFs.

Resolving Assumption Violations

Options include:

  • transform the Y variable, likely with one of our key power transformations (use Box-Cox to help)
  • transform one or more of the X variables if it seems particularly problematic, or perhaps combine them (rather than height and weight, perhaps look at BMI, or BMI and height to help reduce collinearity)

Resolving Assumption Violations

Options include:

  • remove a point only if you have a good explanation for the point that can be provided outside of the modeling, and this is especially important if the point is influential
  • consider other methods for establishing a non-linear model (432: splines, loess smoothers, non-linear modeling)
  • consider other methods for longitudinal data with substantial dependence (432)

Six Simulations To Help You Calibrate Yourself

For each simulation, decide:

Is one of the regression assumptions violated?

  • Linearity, Homoscedasticity, Normality, or multiple problems?
    • All of these simulations describe cross-sectional data, with no importance to the order of the observations, so the assumption of independence isn’t a concern.
  • In which of the four plot(s) shown do you see the problem?

For each simulation, decide:

  • If you see a point that is problematic, then:
    • is it poorly fit?
    • is it highly leveraged?
    • is it influential?
  • What might you try to do about the assumption problem you see (if you see one), to resolve it?

This isn’t easy. We’ll do three, and then regroup.

OK. How are we doing so far?

The First Three Simulations

For those of you playing along at home…

  1. Observation 1 has an impossibly large standardized residual (Z score is close to 12), of substantial influence (Cook’s distance around 0.7).
    • Probably need to remove the point, and explain it separately.

The First Three Simulations

  1. Curve in residuals vs. fitted values plot suggests potential non-linearity.
    • Natural choice would be a transformation of the outcome.
  2. No substantial problems, although there’s a little bit of heteroscedasticity.
    • I’d probably just go with the model as is.

Let’s try three more…

The Last Three Simulations

  1. Normality issues - outlier-prone even with 1000 observations.
    • Transform Y? Consider transforming the Xs?
  2. Serious heteroscedasticity - residuals much more varied for larger fitted values.
    • Look at Residuals vs. each individual X to see if this is connected to a specific predictor, which might be skewed or something?

The Last Three Simulations

  1. No serious violations - point 496 has very substantial leverage, though.
    • I’d probably just go with the model as is, after making sure that point 496’s X values aren’t incorrect.

What’s the Goal Here?

Develop an effective model. (?) (!)

  • Models can do many different things. What you’re using the model for matters, a lot.
  • Don’t fall into the trap of making binary decisions (this model isn’t perfect, no matter what you do, and so your assessment of residuals will also have shades of gray).
  • The tools we have provided (scatterplots, mostly) are well designed for rather modest sample sizes. When you have truly large samples, they don’t scale very well.

Developing effective models

  • Just because R chooses four plots for you to study doesn’t mean they provide the only relevant information.
  • Embrace the uncertainty. Look at the process of checking assumptions as an opportunity to study your data more effectively.

Working with Larger Cross-Tabulations

A \(2 \times 3\) contingency table

This table displays the count of patients who show complete, partial, or no response after treatment with either active medication or a placebo in a study of 100 patients…

Group None Partial Complete
Active 8 24 20
Placebo 12 26 10

Is there a statistically detectable association here, at \(\alpha = 0.10\)?

The Pearson Chi-Square Test

  • \(H_0\): Response Distribution is the same, regardless of Treatment.
  • \(H_A\): There is an association between Treatment and Response.

The Pearson \(\chi^2\) test assumes the null hypothesis is true (rows and columns are independent.) That is a model for our data. How does it work?

Calculating Chi-Square

Here’s the table, with marginal totals added.

None Partial Complete TOTAL
Active 8 24 20 52
Placebo 12 26 10 48
TOTAL 20 50 30 100

The test needs to estimate the expected frequency in each of the six cells under the assumption of independence. If the rows and columns were independent, what is the expected count in the Active/None cell?

The Independence Model

None Partial Complete TOTAL
Active 52
Placebo 48
TOTAL 20 50 30 100

If the rows and columns were independent, then:

  • 20/100 of subjects would have response = “None”
    • That’s 20% of the 52 Active, and 20% of the 48 Placebo
  • 50% would have a “Partial” response, and
  • 30% would have a “Complete” response in each group.

Observed (Expected) Cell Counts

So, can we fill in the expected frequencies under our independence model?

None Partial Complete TOTAL
Active 8 (10.4) 24 (26.0) 20 (15.6) 52
Placebo 12 (9.6) 26 (24.0) 10 (14.4) 48
TOTAL 20 50 30 100

General Formula for Expected Frequencies under Independence

\[ \mbox{Expected Frequency} = \frac{\mbox{Row total} \times \mbox{Column total}}{\mbox{Grand Total}} \]

This assumes that the independence model holds: the probability of being in a particular column is exactly the same in each row, and vice versa.

Chi-Square Assumptions

  • Expected Frequencies: We assume that the expected frequency, under the null hypothesized model of independence, will be at least 5 (and ideally at least 10) in each cell. If that is not the case, then the \(\chi^2\) test is likely to give unreliable results.
  • The Cochran conditions require us to have no cells with zero counts and at least 80% of the cells in our table with expected counts of 5 or higher. That’s what R uses to warn you of trouble.
  • Don’t meet the standards? Consider collapsing categories.

Observed (Expected) Cell Counts

None Partial Complete TOTAL
Active 8 (10.4) 24 (26.0) 20 (15.6) 52
Placebo 12 (9.6) 26 (24.0) 10 (14.4) 48
TOTAL 20 50 30 100
  • Do we meet the Cochran conditions in this case?

Getting the Table into R

We’ll put the table into a matrix in R. Here’s one approach…

T1 <- matrix(c(8, 24, 20, 12, 26, 10), 
             ncol=3, nrow=2, byrow=TRUE)
rownames(T1) <- c("Active", "Placebo")
colnames(T1) <- c("None", "Partial", "Complete")
T1
        None Partial Complete
Active     8      24       20
Placebo   12      26       10
chisq.test(T1)

    Pearson's Chi-squared test

data:  T1
X-squared = 4.0598, df = 2, p-value = 0.1313

Chi-Square Test Results in R

  • \(H_0\): Response Distribution is the same, regardless of Treatment.
    • Rows and Columns of the table are independent
  • \(H_A\): There is an association between Treatment and Response.
    • Rows and Columns of the table are associated.
  • For our T1, the results were: \(\chi^2\) = 4.0598, df = 2, p = 0.1313

What is the conclusion?

Does Sample Size Affect The \(\chi^2\) Test?

  • T1 results were: \(\chi^2\) = 4.0598, df = 2, p = 0.1313
  • What if we had the same pattern, but twice as much data?
T1_doubled <- T1*2
T1_doubled
        None Partial Complete
Active    16      48       40
Placebo   24      52       20
chisq.test(T1_doubled)

    Pearson's Chi-squared test

data:  T1_doubled
X-squared = 8.1197, df = 2, p-value = 0.01725

Fisher’s exact test instead?

Yes, but … if the Pearson assumptions don’t hold, then the Fisher’s test is not generally an improvement.

fisher.test(T1)

    Fisher's Exact Test for Count Data

data:  T1
p-value = 0.1358
alternative hypothesis: two.sided
  • It’s also really meant more for square tables, with the same number of rows as columns, and relatively modest sample sizes.

Example: dm1000 (see Classes 8-9)

dm1000 <- read_rds("c21/data/dm_1000.Rds") |>
    select(subject, tobacco, insurance) |>
    drop_na()

head(dm1000)
# A tibble: 6 × 3
  subject tobacco insurance 
  <chr>   <fct>   <fct>     
1 M-0001  Current Medicaid  
2 M-0002  Never   Commercial
3 M-0003  Former  Medicare  
4 M-0004  Never   Medicaid  
5 M-0005  Never   Medicare  
6 M-0006  Current Medicaid  

Arrange the Factors in a Useful Order

dm1000 <- dm1000 |>
    mutate(tobacco = 
               fct_relevel(tobacco, "Current", "Former"),
           insurance = 
               fct_relevel(insurance, "Medicare", 
                           "Commercial", "Medicaid"))

dm1000 |> tabyl(tobacco, insurance) |> 
    adorn_totals(where = c("row", "col"))
 tobacco Medicare Commercial Medicaid Uninsured Total
 Current       99         44      118        13   274
  Former      183         70      103        11   367
   Never      140         80      105        18   343
   Total      422        194      326        42   984

dm1000: Two Categorical Variables of interest

p1 <- ggplot(dm1000, aes(x = insurance)) + geom_bar() + 
    geom_text(aes(label = ..count..), stat = "count", 
              vjust = 1.5, col = "white")

p2 <- ggplot(dm1000, aes(x = tobacco)) + geom_bar() + 
    geom_text(aes(label = ..count..), stat = "count", 
              vjust = 1.5, col = "white")

p1 + p2 

dm1000: Two Categorical Variables of interest

A \(4 \times 3\) table with the dm1000 data

dm1000 |> 
    tabyl(insurance, tobacco) |>
    adorn_totals(where = c("row", "col"))
  insurance Current Former Never Total
   Medicare      99    183   140   422
 Commercial      44     70    80   194
   Medicaid     118    103   105   326
  Uninsured      13     11    18    42
      Total     274    367   343   984

Plotting a Cross-Tabulation?

ggplot(dm1000, aes(x = insurance, y = tobacco)) +
    geom_count() 

Tobacco Bar Chart faceted by Insurance

ggplot(dm1000, aes(x = tobacco, fill = tobacco)) + 
    geom_bar() + facet_wrap(~ insurance) +
    guides(fill = "none") + 
    geom_text(aes(label = ..count..), stat = "count", 
              vjust = 1, col = "black")

Tobacco Bar Chart faceted by Insurance

Tobacco Status and Insurance

  • \(H_0\): Insurance type and Tobacco status are independent
  • \(H_A\): Insurance type and Tobacco status are associated

Pearson \(\chi^2\) results?

dm1000 |> tabyl(insurance, tobacco) |> chisq.test()

    Pearson's Chi-squared test

data:  tabyl(dm1000, insurance, tobacco)
X-squared = 25.592, df = 6, p-value = 0.0002651

Can we check our expected frequencies?

Checking Expected Frequencies

res <- dm1000 |> tabyl(insurance, tobacco) |> chisq.test()

res$observed
  insurance Current Former Never
   Medicare      99    183   140
 Commercial      44     70    80
   Medicaid     118    103   105
  Uninsured      13     11    18
res$expected
  insurance   Current    Former     Never
   Medicare 117.50813 157.39228 147.09959
 Commercial  54.02033  72.35569  67.62398
   Medicaid  90.77642 121.58740 113.63618
  Uninsured  11.69512  15.66463  14.64024

Any problems with Cochran conditions?

Mosaic Plot for Cross-Tabulation

Each rectangle’s area is proportional to the number of cases in that cell.

plot(dm1000$insurance, dm1000$tobacco, ylab = "", xlab = "")

Mosaic Plot from the vcd package (highlighting)

mosaic(~ tobacco + insurance, data = dm1000, 
       highlighting = "tobacco", 
       highlighting_fill = c("red", "gray50", "white"))

Mosaic Plot from the vcd package (with \(\chi^2\) shading)

mosaic(~ tobacco + insurance, data = dm1000, shade = TRUE)

Notes

  • I will get Project A back to you ASAP.
  • The Minute Paper originally scheduled for tomorrow is canceled.
  • Project B Registration Form is due at 9 AM Thursday 2023-11-16.

Session Information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2023-11-14
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.3   2023-09-03 [1] CRAN (R 4.3.1)
 evaluate      0.22    2023-09-29 [1] CRAN (R 4.3.1)
 fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.1)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.1)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.1)
 htmltools     0.5.6.1 2023-10-06 [1] CRAN (R 4.3.1)
 janitor     * 2.2.0   2023-02-02 [1] CRAN (R 4.3.1)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.1)
 knitr         1.44    2023-09-11 [1] CRAN (R 4.3.1)
 labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.1)
 lattice       0.21-8  2023-04-05 [2] CRAN (R 4.3.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
 lmtest        0.9-40  2022-03-21 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
 MASS          7.3-60  2023-05-04 [2] CRAN (R 4.3.1)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.1)
 patchwork   * 1.1.3   2023-08-14 [1] CRAN (R 4.3.1)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.1)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
 snakecase     0.11.1  2023-08-27 [1] CRAN (R 4.3.1)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.1)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.1)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.1)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
 vcd         * 1.4-11  2023-02-01 [1] CRAN (R 4.3.1)
 vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.1)
 withr         2.5.1   2023-09-26 [1] CRAN (R 4.3.1)
 xfun          0.40    2023-08-09 [1] CRAN (R 4.3.1)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
 zoo           1.8-12  2023-04-13 [1] CRAN (R 4.3.1)

 [1] C:/Users/thoma/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────